Indexing Weighted Sequences: Neat and Efficient

نویسندگان

Carl Barton

Tomasz Kociumaka

Chang Liu

Solon P. Pissis

Jakub Radoszewski

چکیده

In a weighted sequence, for every position of the sequence and every letter of the alphabet a probability of occurrence of this letter at this position is specified. Weighted sequences are commonly used to represent imprecise or uncertain data, for example, in molecular biology where they are known under the name of Position-Weight Matrices. Given a probability threshold 1 z , we say that a string P of length m occurs in a weighted sequence X at position i if the product of probabilities of the letters of P at positions i, . . . , i+m− 1 in X is at least 1 z . In this article, we consider an indexing variant of the problem, in which we are to preprocess a weighted sequence to answer multiple pattern matching queries. We present an O(nz)-time construction of an O(nz)-sized index for a weighted sequence of length n over a constant-sized alphabet that answers pattern matching queries in optimal, O(m+Occ) time, where Occ is the number of occurrences reported. The cornerstone of our data structure is a novel construction of a family of ⌊z⌋ special strings that carries the information about all the strings that occur in the weighted sequence with a sufficient probability. We obtain a weighted index with the same complexities as in the most efficient previously known index by Barton et al. [3], but our construction is significantly simpler. The most complex algorithmic tool required in the basic form of our index is the suffix tree which we use to develop a new, more straightforward index for the so-called property matching problem. We provide an implementation of our data structure. Our construction allows us also to obtain a significant improvement over the complexities of the approximate variant of the weighted index presented by Biswas et al. [6] and an improvement of the space complexity of their general index.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Novel Approaches to Biomolecular Sequence Indexing

In many biomolecular database applications involving string/sequence data, it is common to have similarity search in the form of near neighbor queries or nearest neighbor queries. The similarity between strings/sequences are typically measured in terms of the least costly set of allowed edit operations that transform one string/sequence to another. In this survey, we briefly describe some of th...

متن کامل

COMBINING FUZZY QUANTIFIERS AND NEAT OPERATORS FOR SOFT COMPUTING

This paper will introduce a new method to obtain the order weightsof the Ordered Weighted Averaging (OWA) operator. We will first show therelation between fuzzy quantifiers and neat OWA operators and then offer anew combination of them. Fuzzy quantifiers are applied for soft computingin modeling the optimism degree of the decision maker. In using neat operators,the ordering of the inputs is not...

متن کامل

I-45: Advance MRI Sequences in Pelvic Endometriosis

Background: To assess MRI in diagnosing endometriotic lesions, emphasizing T2*weighted imaging efficacy. Materials and Methods: This prospective study of 48 females (22-38 years, average 29.6) clinically suspected of endometriosis from September 2009 to April 2012. MRI was performed with a 1.5 T imager (Siemens) with a body array coil. T1, T2 and T2* weighted (2D-FLASH) sequences were obtained ...

متن کامل

Indexing Weighted-Sequences in Large Databases

We present an index structure for managing weightedsequences in large databases. A weighted-sequence is defined as a two-dimensional structure where each element in the sequence is associated with a weight. A series of network events, for instance, is a weighted-sequence in that each event has a timestamp. Querying a large sequence database by events’ occurrence patterns is a first step towards...

متن کامل

Efficient Similarity Search for Time Series Data Based on the Minimum Distance

We address the problem of efficient similarity search based on the minimum distance in large time series databases. Most of previous work is focused on similarity matching and retrieval of time series based on the Euclidean distance. However, as we demonstrate in this paper, the Euclidean distance has limitations as a similarity measurement. It is sensitive to the absolute offsets of time seque...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

CoRR

دوره abs/1704.07625 شماره

صفحات -

تاریخ انتشار 2017

Indexing Weighted Sequences: Neat and Efficient

نویسندگان

چکیده

منابع مشابه

Novel Approaches to Biomolecular Sequence Indexing

COMBINING FUZZY QUANTIFIERS AND NEAT OPERATORS FOR SOFT COMPUTING

I-45: Advance MRI Sequences in Pelvic Endometriosis

Indexing Weighted-Sequences in Large Databases

Efficient Similarity Search for Time Series Data Based on the Minimum Distance

عنوان ژورنال:

اشتراک گذاری